Exploratory Data Analysis

This exploratory data analysis (EDA) aims to uncover insights from two combined datasets—one containing lagged sentiment data from the second half of the previous year and the other with current sentiment data from the first half of this year. By integrating customer data with sentiment scores, this analysis seeks to explore how customer sentiment correlates with key variables such as Customer Lifetime Value, coverage types, financial metrics, and policy details across different states and time periods. Through the lens of five key research quesßtions, the analysis will identify trends, patterns, and relationships to provide a deeper understanding of the interplay between customer sentiment and various attributes, paving the way for actionable insights.

Question

How do state-wise sentiment trends change over time within the selected time periods?

The plot illustrates the sentiment trends for five states over the period from June 2023 to July 2024. Nevada consistently exhibits the most negative sentiment, particularly in October and November 2023, where its sentiment scores are significantly lower than other states. In contrast, California tends to show relatively positive sentiment throughout the timeline, with peaks in July 2023 and smaller variations compared to other states. Arizona shows fluctuating sentiment, occasionally trending upwards to positive values in February 2024. States like Oregon and Washington exhibit moderately negative sentiment, though Washington stands out with occasional positive spikes, such as in October 2023. Overall, the months of October and November 2023 demonstrate downward sentiment trends across most states, while February 2024 shows improvement for several states, such as Nevada and Arizona. Similar patterns are seen between Oregon and Washington, which generally move in the same direction during several months.

This table above summarizes the distribution of sentiment categories (Negative, Neutral, Positive, Strong Negative, Strong Positive) for each state based on auto insurance-related sentiment data. California, Oregon, and Washington have the highest counts of negative sentiments, implying potential dissatisfaction in these regions. Nevada is the only state with presence of strong positive sentiments, but it is also the state with the presense of highest negative. This suggests that Nevada auto insurance sentiments tend to change rapidly and aggressively from one month to another. Oregon and Arizona exhibit relatively balanced counts among the categories, with more positive sentiments than strong extremes. Overall, all five states have higher average negative sentiments for each month within the year than the positive ones, which suggests all states more dissatisfied auto insurance related customers than the satisfied ones. When aggregating the above table

The above shows the percentage of positive sentiment among Reddit commenters across five states. While all five states predominantly exhibit negative sentiment overall, Nevada stands out with the highest positive sentiment at 45.45%, suggesting relatively more favorable discussions. In contrast, California has the lowest positive sentiment at just 9.09%, indicating limited positivity in online conversations.

Summary of Customer Metrics by Marital Status
Marital.Status avg_premium number_of_customers avg_claim_amount avg_income
Divorced 127.30 524 556.37 57178.83
Married 125.80 2153 512.37 60943.99
Single 124.89 990 745.57 30615.05

This table summarizes customer metrics by marital status:

This suggests that marital status significantly influences these financial metrics.

Total Claim Amount by State
State total_claim_amount
Arizona 414801.2
California 739033.1
Nevada 193150.1
Oregon 580152.2
Washington 205650.9

The table above shows that California has the highest total claim amount at $739,033.1, followed by Oregon at $580,152.2. Nevada and Washington have relatively lower claim amounts, with Arizona being mid-range among the states.

The chart shows the distribution of Customer Lifetime Value across vehicle classes and marital status. “Four-Door Cars” and SUVs are leading the customer preference, dominated by married customers who also represent a larger share of higher CLV tiers. Single customers are more equitably distributed across vehicle classes but less concentrated in high-value segments, while divorced customers have the smallest representation overall, with fewer high-value customers. Low and medium categories are leading most of the vehicle classes, which shows an opportunity to increase the high-value customers. The businesses can be targeted towards married customers in the high-value vehicle categories of “Four-Door Cars” and SUVs while exploring strategies for engaging and upselling single and divorced customers in niche categories such as “Luxury Cars” and “Luxury SUVs.”

Code
# Create a 'month_year' column for easy grouping
reddit_df['month_year'] = pd.to_datetime(reddit_df[['year', 'month']].assign(day=1))

# Group by subreddit, sentiment, and month_year, then count the occurrences
sentiment_counts = reddit_df.groupby(['subreddit', 'general_sentiment', 'month_year']).size().reset_index(name='count')

# Map sentiment to indices for Sankey diagram
sentiment_mapping = {
    'Strong Positive': 0,
    'Positive': 1,
    'Neutral': 2,
    'Negative': 3,
    'Strong Negative': 4
}

# Prepare the source, target, and values for the Sankey diagram
sources = []
targets = []
values = []
links_colors = []
labels = ['Strong Positive', 'Positive', 'Neutral', 'Negative', 'Strong Negative']

# Define color map for different sentiment flows (links)
sentiment_colors = {
    'Strong Positive': '#6baed6',  # Soft Blue
    'Positive': '#66c2a5',         # Light Green
    'Neutral': '#f7c09e',          # Soft Yellow
    'Negative': '#f58484',         # Light Coral
    'Strong Negative': '#c4a6d2'   # Lavender
}

# Create a list of subreddits
subreddits = sentiment_counts['subreddit'].unique()
for idx, subreddit in enumerate(subreddits):
    # Add nodes for subreddits
    labels.append(subreddit)
    
    # Iterate through sentiments
    for sentiment, sentiment_idx in sentiment_mapping.items():
        filtered_data = sentiment_counts[(sentiment_counts['subreddit'] == subreddit) &
                                         (sentiment_counts['general_sentiment'] == sentiment)]
        
        for _, row in filtered_data.iterrows():
            sources.append(sentiment_idx)
            targets.append(len(sentiment_mapping) + idx)  # Subreddit node index
            values.append(row['count'])
            links_colors.append(sentiment_colors[sentiment])

# Create Sankey diagram
fig = go.Figure(go.Sankey(
    node=dict(
        pad=15,
        thickness=20,
        line=dict(color="black", width=0.5),
        label=labels,
        color="lightgray"
    ),
    link=dict(
        source=sources,
        target=targets,
        value=values,
        color=links_colors
    )
))

fig.update_layout(title_text="Sentiment Flow Across Subreddits", font_size=10)
fig.show()
Warning: `includeHTML()` was provided a `path` that appears to be a complete HTML document.
✖ Path: /Users/parsakeyvani/Downloads/reddit_viz1.html
ℹ Use `tags$iframe()` to include an HTML document. You can either ensure `path` is accessible in your app or document (see e.g. `shiny::addResourcePath()`) and pass the relative path to the `src` argument. Or you can read the contents of `path` and pass the contents to `srcdoc`.

This Sankey diagram visualizes the flow of sentiments across three subreddits—cars, legaladvice, and personalfinance—from June 2023 to July 2024. The diagram is divided into five sentiment categories: Strong Positive, Positive, Neutral, Negative, and Strong Negative. Legaladvice consistently exhibits a high volume of Strong Negative sentiment, reflecting that many users in this subreddit are seeking help for difficult or negative situations. Despite the predominance of negative sentiment, there are fluctuations in Positive and Neutral sentiments, suggesting that while a significant portion of discussions is focused on negative experiences, there are also moments of positive or neutral advice shared.

In the Personalfinance subreddit, Strong Positive sentiment takes up a larger share, indicating that conversations are primarily centered around positive financial experiences and achievements. However, there also exists negative sentiments, revealing that not all discussions in this subreddit are positive. The Cars subreddit shows relatively lower activity compared to the other two, suggesting limited engagement in the topic.

This plot offers valuable insights into how sentiments evolve for each subreddit during the time period, helping to identify the most prominent sentiment categories and trends throughout the year.

Code
from plotly.subplots import make_subplots

# Create a 'month_year' column for easy grouping
reddit_df['month_year'] = pd.to_datetime(reddit_df[['year', 'month']].assign(day=1))

# Group by 'month_year', 'subreddit', and 'general_sentiment', and count occurrences
sentiment_counts = reddit_df.groupby(['month_year', 'subreddit', 'general_sentiment']).size().reset_index(name='count')

# Pivot data to have sentiment types as columns for each month
pivot_data = sentiment_counts.pivot_table(index=['month_year', 'subreddit'], columns='general_sentiment', values='count', aggfunc='sum', fill_value=0)

# Reset index for easy manipulation
pivot_data = pivot_data.reset_index()

# Calculate total counts by each subreddit (sum across all sentiments and months)
total_counts = pivot_data.groupby('subreddit').sum().sum(axis=1)

# Normalize the counts by dividing the counts for each sentiment and month by the total count for each subreddit
for sentiment in pivot_data.columns[2:]:  # Start from the sentiment columns (ignore 'month_year' and 'subreddit')
    pivot_data[sentiment] = pivot_data.apply(
        lambda row: row[sentiment] / total_counts[row['subreddit']] if total_counts[row['subreddit']] > 0 else 0, axis=1
    )

# Number of months (should be the same for each subreddit)
num_months = len(pivot_data['month_year'].unique())

# Define angles (uniformly distributed around the circle)
angles = np.linspace(0.0, 2 * np.pi, num_months, endpoint=False)

# Define the sentiment types
sentiments = ['Strong Positive', 'Positive', 'Neutral', 'Negative', 'Strong Negative']

# Define a custom color palette for each subreddit
subreddit_colors = {
    'cars': '#6baed6',  # Soft Blue
    'legaladvice': '#66c2a5',         # Light Green
    'personalfinance': '#f7c09e'     # Soft Yellow
}

# Create subplots: 5 subplots, each for one sentiment type
fig = make_subplots(
    rows=1, cols=5,  # 1 row, 5 columns
    subplot_titles=sentiments,
    shared_yaxes=True,  # Share the y-axis across subplots
    horizontal_spacing=0.09,
    specs=[[{'type': 'polar'}]*5]  # Specify polar subplot type
)

# Loop over sentiment types and plot each one in its own subplot
for idx, sentiment in enumerate(sentiments):
    # Filter data for the current sentiment
    data = pivot_data[['month_year', 'subreddit', sentiment]]
    
    # Ensure the data for months is aligned with the number of months
    for subreddit in pivot_data['subreddit'].unique():
        sentiment_data = data[data['subreddit'] == subreddit][sentiment].values
        if sentiment_data.shape[0] < num_months:  # In case some months are missing for the sentiment
            sentiment_data = np.pad(sentiment_data, (0, num_months - sentiment_data.shape[0]), constant_values=0)
        
        # Plot sentiment data for the current sentiment and subreddit
        fig.add_trace(
            go.Scatterpolar(
                r=sentiment_data,
                theta=data['month_year'].dt.strftime('%b %Y').unique(),
                mode='lines',
                name=subreddit,
                line=dict(color=subreddit_colors.get(subreddit, '#333333'), width=3),  # Default color if subreddit not in the palette
                hovertemplate=(
                    f'Sentiment: {sentiment}<br>'  # Display the sentiment type
                    f'Subreddit: {subreddit}<br>'  # Display the subreddit
                    'Count: %{r}<br>'  # Display the sentiment count (r)
                    '<extra></extra>'  # Hide the default trace name
                )
            ),
            row=1, col=idx+1  # Place each trace in the corresponding subplot
        )

# Update layout and axis settings
fig.update_layout(
    title='Sentiment Trends Across Subreddits (June 2023 - July 2024)',
    showlegend=True,
    legend=dict(
        x=1.05,  # Move the legend to the right outside the plot
        y=0.5,   # Center the legend vertically
        traceorder='normal',
        font=dict(size=12),
        borderwidth=1,
        bordercolor='Black'
    ),
    height=430,  # Set plot height
    width=1800,  # Set plot width
)

# Loop over each subplot to individually set the radial axis range for all subplots
for i in range(1, 6):  # Loop through all 5 subplots (polar subplots)
    fig.update_layout(
        **{
            f'polar{i}': {
                'radialaxis': {
                    'range': [0, 0.07],  # Set the same range for each subplot
                    'showticklabels': True,
                    'ticks': ''
                }
            }
        }
    )

# Show the plot
fig.show()
Warning: `includeHTML()` was provided a `path` that appears to be a complete HTML document.
✖ Path: /Users/parsakeyvani/Downloads/reddit_viz2.html
ℹ Use `tags$iframe()` to include an HTML document. You can either ensure `path` is accessible in your app or document (see e.g. `shiny::addResourcePath()`) and pass the relative path to the `src` argument. Or you can read the contents of `path` and pass the contents to `srcdoc`.

To gain a deeper understanding of sentiment trends across subreddits, we calculated the normalized sentiment counts for each subreddit and month. This normalization helped us compare the sentiment distributions across subreddits by adjusting for differences in total activity levels. Even after normalization, we observed that LegalAdvice exhibited a consistent dominance of strong negative sentiments, suggesting that discussions within this subreddit tend to be more focused on challenging or negative legal situations.

Interestingly, the Cars subreddit showed a more diverse range of sentiments, with strong positive, positive, and negative sentiments being notably present. This suggests the existence of two distinct groups within the community—those who are satisfied with their cars and those who are dissatisfied. These variations in sentiment could be influenced by factors such as the type of vehicle, the model, or individual experiences related to car ownership.

Overall, the results suggest that while conversations around LegalAdvice and PersonalFinance tend to maintain a steady sentiment, discussions about Cars reflect a broader emotional spectrum, possibly due to varying personal experiences related to vehicle ownership.

Code
# Group the data by 'state' and 'general_sentiment' to get the average sentiment and count
state_sentiment = reddit_df.groupby(['state', 'general_sentiment']).agg(
    post_count=('state', 'size'),
    avg_sentiment=('compound_score', 'mean')
).reset_index()

# Create a bubble chart
fig = px.scatter(state_sentiment, 
                 x='state', 
                 y='avg_sentiment', 
                 size='post_count', 
                 color='general_sentiment',
                 hover_name='state',
                 size_max=100,
                 title='Bubble Chart: State vs. Average Sentiment by Sentiment Type')

fig.update_layout(
    xaxis_title="State",
    yaxis_title="Average Sentiment Score",
    legend_title="Sentiment",
    showlegend=True,
    width=1000,
    height=600
)

fig.show()
Warning: `includeHTML()` was provided a `path` that appears to be a complete HTML document.
✖ Path: /Users/parsakeyvani/Downloads/reddit_viz3.html
ℹ Use `tags$iframe()` to include an HTML document. You can either ensure `path` is accessible in your app or document (see e.g. `shiny::addResourcePath()`) and pass the relative path to the `src` argument. Or you can read the contents of `path` and pass the contents to `srcdoc`.

In this bubble chart, we explore whether certain states have a higher number of posts or comments associated with specific sentiments, aiming to identify if any state stands out with a particular sentiment that is notably prevalent. What we observed is that California had the largest number of posts and comments across all sentiment categories, indicating that this state has a significantly higher volume of activity compared to others. This results in an imbalance in the number of posts and comments across the states, with California’s dominance clearly visible. The chart also reveals that most of the posts in California are associated with Strong Negative and Strong Positive sentiments, suggesting that discussions in this state tend to involve extreme emotional tones, whether positive or negative.

When looking at other states, we notice that Strong Negative sentiment is the most prevalent sentiment. This indicates that in all of the five states, the conversations are largely focused on negative experiences or challenging situations. In contrast to California, where strong positive and negative sentiments are more balanced, other states seem to lean more towards negative sentiments.

Overall, the plot provides a clear visualization of how sentiments are distributed across different states, with notable trends such as California’s high volume of extreme sentiments and the dominance of Strong Negative sentiment in most other states. This allows us to gain valuable insights into how different states engage with various emotional tones in online discussions throughout the year.

Code
# Plot the choropleth map for the selected year
fig = px.choropleth(
    state_monthly_sentiment, 
    locations='state_abb',
    locationmode='USA-states',
    color='compound_score', 
    color_continuous_scale="teal",
    hover_name='state_abb',
    animation_frame='year_month',  # This will animate through each month
    labels={'compound_score': 'Average Sentiment Score'},
    title="Monthly Sentiment Score by State (2023 & 2024)"
)

fig.update_geos(
    visible=True,
    showlakes=True,
    lakecolor="rgb(255, 255, 255)",
    projection_type="albers usa"
)

fig.show()
Warning: `includeHTML()` was provided a `path` that appears to be a complete HTML document.
✖ Path: /Users/parsakeyvani/Downloads/reddit_viz4.html
ℹ Use `tags$iframe()` to include an HTML document. You can either ensure `path` is accessible in your app or document (see e.g. `shiny::addResourcePath()`) and pass the relative path to the `src` argument. Or you can read the contents of `path` and pass the contents to `srcdoc`.

Lastly, we wanted to examine how sentiment scores varied across different states over time. To achieve this, we used a geoplot to visualize sentiment distribution by state, showing how sentiment shifted from one month to another. This plot provides a geographical view of sentiment scores across states, allowing us to identify regional trends and how they evolved over the year. By observing the colors representing sentiment intensities and changes over time, we can track fluctuations in sentiment across different areas, revealing patterns of positive, neutral, and negative sentiment across the country.

From the geoplot, we can see that California consistently exhibits high levels of Positive sentiment, particularly in the summer months, with a noticeable peak in June 2024. This trend suggests that conversations in California are generally more positive, especially around events or topics that likely resonate with residents during this time. Meanwhile, Nevada shows a more mixed sentiment distribution, with both Neutral and Negative sentiments appearing prominently throughout the year. Interestingly, Strong Negative sentiment rises slightly during the winter months, possibly reflecting seasonal or local events. In Oregon, Negative sentiment appears more frequently compared to other states, with Neutral sentiments fluctuating, especially around November 2023. The emotional tone in Oregon seems more balanced, with less drastic swings between positive and negative sentiments. Washington displays a slightly higher amount of Neutral sentiment, indicating that posts from this state are generally more balanced or indifferent in tone. The Strong Negative sentiment in Washington is notably low, suggesting fewer extreme negative emotions in the discourse compared to the other states. Finally, Arizona shows more Neutral sentiment across the year, with occasional spikes in Strong Positive sentiment, particularly in April 2024. The state’s sentiment distribution appears more stable, with fewer fluctuations in extreme emotional tones.